Goto

Collaborating Authors

 asymptotic risk


0ebcc77dc72360d0eb8e9504c78d38bd-Paper.pdf

Neural Information Processing Systems

As a consequence, empirical risk minimizers generally perform very poorly in extreme regions. It is the purpose of this paper to develop a general framework for classification in the extremes.






The Minimax Risk in Histogram-Based Uniformity Testing under Missing Ball Alternatives

Kipnis, Alon

arXiv.org Machine Learning

We study the problem of testing the goodness of fit of a discrete sample from many categories to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an $\ell_p$ ball of radius $\epsilon$ around the uniform rate sequence for $p \leq 2$. When the number of samples $n$ and number of categories $N$ go to infinity while $\epsilon$ is small, the minimax risk $R_\epsilon^*$ in testing based on the sample's histogram (number of absent categories, singletons, collisions, ...) asymptotes to $2\Phi(-n N^{2-2/p} \epsilon^2/\sqrt{8N})$; $\Phi(x)$ is the normal CDF. This result allows the comparison of the many estimators previously proposed for this problem at the constant level, rather than at the rate of convergence of the risk or the scaling order of the sample complexity. The minimax test mostly relies on collisions in the very small sample limit but otherwise behaves like the chisquared test. Empirical studies over a range of problem parameters show that our estimate is accurate in finite samples and that the minimax test is significantly better than the chisquared test or a test that only uses collisions. Our analysis relies on the asymptotic normality of histogram ordinates, the equivalence between the minimax setting and a Bayesian setting, and the characterization of the least favorable prior by reducing a multi-dimensional optimization problem to a one-dimensional problem.


Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks

Nakada, Ryumei, Imaizumi, Masaaki

arXiv.org Machine Learning

We investigate the asymptotic risk of a general class of overparameterized likelihood models, including deep models. The recent empirical success of large-scale models has motivated several theoretical studies to investigate a scenario wherein both the number of samples, $n$, and parameters, $p$, diverge to infinity and derive an asymptotic risk at the limit. However, these theorems are only valid for linear-in-feature models, such as generalized linear regression, kernel regression, and shallow neural networks. Hence, it is difficult to investigate a wider class of nonlinear models, including deep neural networks with three or more layers. In this study, we consider a likelihood maximization problem without the model constraints and analyze the upper bound of an asymptotic risk of an estimator with penalization. Technically, we combine a property of the Fisher information matrix with an extended Marchenko-Pastur law and associate the combination with empirical process techniques. The derived bound is general, as it describes both the double descent and the regularized risk curves, depending on the penalization. Our results are valid without the linear-in-feature constraints on models and allow us to derive the general spectral distributions of a Fisher information matrix from the likelihood. We demonstrate that several explicit models, such as parallel deep neural networks, ensemble learning, and residual networks, are in agreement with our theory. This result indicates that even large and deep models have a small asymptotic risk if they exhibit a specific structure, such as divisibility. To verify this finding, we conduct a real-data experiment with parallel deep neural networks. Our results expand the applicability of the asymptotic risk analysis, and may also contribute to the understanding and application of deep learning.


Asymptotic Risk of Bezier Simplex Fitting

Tanaka, Akinori, Sannai, Akiyoshi, Kobayashi, Ken, Hamada, Naoki

arXiv.org Machine Learning

The Bezier simplex fitting is a novel data modeling technique which exploits geometric structures of data to approximate the Pareto front of multi-objective optimization problems. There are two fitting methods based on different sampling strategies. The inductive skeleton fitting employs a stratified subsampling from each skeleton of a simplex, whereas the all-at-once fitting uses a non-stratified sampling which treats a simplex as a whole. In this paper, we analyze the asymptotic risks of those B\'ezier simplex fitting methods and derive the optimal subsample ratio for the inductive skeleton fitting. It is shown that the inductive skeleton fitting with the optimal ratio has a smaller risk when the degree of a Bezier simplex is less than three. Those results are verified numerically under small to moderate sample sizes. In addition, we provide two complementary applications of our theory: a generalized location problem and a multi-objective hyper-parameter tuning of the group lasso. The former can be represented by a Bezier simplex of degree two where the inductive skeleton fitting outperforms. The latter can be represented by a Bezier simplex of degree three where the all-at-once fitting gets an advantage.


How many variables should be entered in a principal component regression equation?

Xu, Ji, Hsu, Daniel

arXiv.org Machine Learning

We study least squares linear regression over $N$ uncorrelated Gaussian features that are selected in order of decreasing variance. When the number of selected features $p$ is at most the sample size $n$, the estimator under consideration coincides with the principal component regression estimator; when $p>n$, the estimator is the least $\ell_2$ norm solution over the selected features. We give an average-case analysis of the out-of-sample prediction error as $p,n,N \to \infty$ with $p/N \to \alpha$ and $n/N \to \beta$, for some constants $\alpha \in [0,1]$ and $\beta \in (0,1)$. In this average-case setting, the prediction error exhibits a `double descent' shape as a function of $p$.


Surprises in High-Dimensional Ridgeless Least Squares Interpolation

Hastie, Trevor, Montanari, Andrea, Rosset, Saharon, Tibshirani, Ryan J.

arXiv.org Machine Learning

Modern deep learning models involve a huge number of parameters. In nearly all applications of these models, current practice suggests that we should design the network to be sufficiently complex so that the model (as trained, typically, by gradient descent) interpolates the data, i.e., achieves zero training error. Indeed, in a thought-provoking experiment, Zhang et al. (2016) showed that state-of-the-art deep neural network architectures can be trained to interpolate the data even when the actual labels are replaced by entirely random ones. Despite their enormous complexity, deep neural networks are frequently seen to generalize well, in meaningful practical problems. At first sight, this seems to defy conventional statistical wisdom: interpolation (vanishing training error) is usually taken to be a proxy for overfitting or poor generalization (large gap between training and test error). In an insightful series of papers, Belkin et al. (2018b,c,a) pointed out that these concepts are, in general, distinct, and interpolation does not contradict generalization. For example, kernel ridge regression is a relatively well-understood setting in which interpolation can coexist with good generalization (Liang and Rakhlin, 2018). In this paper, we examine the prediction risk of minimum l norm or "ridgeless" least squares regression, under